Simple linear regression: categorical predictor (Notes)

STAT 155

Author

Your Name

Notes

Learning goals

By the end of this lesson, you should be able to:

  • Write a model formula for a simple linear regression model with a categorical predictor using indicator variables
  • Interpret the intercept and slope coefficients in a simple linear regression model with a categorical predictor

Readings and videos

Complete both the reading and the videos to go through before class.

File organization: Save this file in the “Activities” subfolder of your “STAT155” folder.

Exercises

Context: Today we’ll explore data on thousands of diamonds to understand how physical characteristics relate to price. Read in the data below.

# Load packages and import data
library(ggplot2)
library(dplyr)

data(diamonds)

# A little bit of data wrangling code - let's not focus on this for now
diamonds <- diamonds %>% 
    mutate(
        cut = factor(cut, ordered = FALSE),
        color = factor(color, ordered = FALSE),
        clarity = factor(clarity, ordered = FALSE)
    )

For the first several exercises, our focus will be on the relationship between diamond price and cut.

Exercise 1: Get to know the data

Write R code to answer the following:

  • How many cases and variables do we have? What does a case represent?
  • What do the first few rows of the data look like?
  • Construct and interpret two different visualizations of the price variable.
  • Construct and interpret a visualization of the cut variable.

Exercise 2: Visualizations

Start by visualizing this relationship of interest, that between price and cut.

  1. The appropriate plot depends upon the type of variables we’re plotting. When exploring the relationship between a quantitative response (price) and a quantitative predictor (cut), a scatterplot was an effective choice. After running the code below, explain why a scatterplot is not effective for exploring the relationship between ridership and our categorical cut predictor.
# Try a scatterplot
ggplot(diamonds, aes(y = price, x = cut)) + 
    geom_point()

  1. Separately run each chunk below, with two plots. Comment (#) on what changes in the code / output.
# Univariate boxplot
ggplot(diamonds, aes(y = price)) + 
    geom_boxplot()

# ???
ggplot(diamonds, aes(y = price, x = cut)) + 
    geom_boxplot()

# Univariate density plot
ggplot(diamonds, aes(x = price)) + 
    geom_density()

# ???
ggplot(diamonds, aes(x = price, color = cut)) + 
    geom_density()

# Univariate histogram
ggplot(diamonds, aes(x = price)) + 
    geom_histogram()

# ???
ggplot(diamonds, aes(x = price)) + 
    geom_histogram() + 
    facet_wrap(~ cut)

  1. Do you notice anything interesting about the relationship between price and cut? What do you think might be happening here?

Exercise 3: Numerical summaries

Let’s follow up our plots with some numerical summaries.

  1. To warm up, first calculate the mean price across all diamonds.
diamonds %>% 
    ___(mean(___))
## Error in parse(text = input): <text>:2:6: unexpected input
## 1: diamonds %>% 
## 2:     __
##         ^
  1. To summarize the trends we observed in the grouped plots above, we can calculate the mean price for each type of cut. This requires the inclusion of the group_by() function:
# Calculate mean price by cut
diamonds %>% 
    group_by(cut) %>% 
    ___(mean(___))
## Error in parse(text = input): <text>:4:6: unexpected input
## 3:     group_by(cut) %>% 
## 4:     __
##         ^
  1. Examine the group mean measurements, and make sure that you can match these numbers up with what you see in the plots.

  2. Based on the results above, we can see that, on average, diamonds with a “Fair” cut tend to cost more than higher-quality cuts. Let’s construct a new variable named cutFair, using on the following criteria:

  • cutFair = 1 if the diamond is of Fair cut
  • cutFair = 0 otherwise (any other value of cut (Good, Very Good, Premium, Ideal))

The ifelse function allows to create a new variable from an existing one, based on whether or not the values in that variable meet a certain “condition” (remember, you can always look up function documentation in R by typing ?ifelse in the Console, and hitting enter!).

Fill in the following code to create cutFair. The condition was given to you already. Try to use this to complete the code.

# In the first blank, put what value cutFair should have if the condition is "met", or TRUE
# In the second blank, put what value cutFair should have if the condition is "not met", or FALSE
diamonds <- diamonds %>%
  mutate(cutFair=ifelse(cut == "Fair", ___, ___))
## Error in parse(text = input): <text>:4:41: unexpected input
## 3: diamonds <- diamonds %>%
## 4:   mutate(cutFair=ifelse(cut == "Fair", __
##                                            ^

Variables like cutFair that are coded as 0/1 to numerically indicate if a categorical variable is at a particular state are known as an indicator variable. You will sometimes see these referred to as a “binary variable” or “dichotomous variable”; you may also encounter the term “dummy variable” in older statistical literature.

  1. Now, let’s calculate the group means based on the new cutFair indicator variable:
diamonds %>% 
    group_by(cutFair) %>% 
    summarize(mean(price))
## Error in `group_by()`:
## ! Must group by variables found in `.data`.
## ✖ Column `cutFair` is not found.

Exercise 4: Modeling trend using a categorical predictor with exactly 2 categories

Next, let’s model the trend in the relationship between the cutFair and price variables using a simple linear regression model:

# Construct the model
diamond_mod0 <- lm(price ~ cutFair, data = diamonds)
## Error in eval(predvars, data, env): object 'cutFair' not found

# Summarize the model
coef(summary(diamond_mod0))
## Error: object 'diamond_mod0' not found

Compare these results to the output of exercise 3e. What do you notice? How do you interpret the intercept and cutFair coefficient terms from this model?

your answer here

Exercise 5: Modeling trend using a categorical predictor with >2 categories

Using a single binary predictor like the cutFair indicator variable is useful when there are two clearly delineated categories. However, the cut variable actually contains 5 categories! Because we’ve collapsed all non-Fair classifications into a single category (i.e. cutFair = 0), the model above can’t tell us anything about the difference in expected price between, say, Premium and Ideal cuts. The good news is that it is very straightforward to model categorical predictors with >2 categories. We can do this by using the cut variable as our predictor:

# Construct the model
diamond_mod <- lm(price ~ cut, data = diamonds)

# Summarize the model
coef(summary(diamond_mod))
##               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)  4358.7578   98.78795 44.122361 0.000000e+00
## cutGood      -429.8933  113.84940 -3.775982 1.595493e-04
## cutVery Good -376.9979  105.16422 -3.584849 3.375707e-04
## cutPremium    225.4999  104.39521  2.160060 3.077240e-02
## cutIdeal     -901.2158  102.41155 -8.799943 1.408406e-18
  1. Even though we specified a single predictor variable in the model, we are seeing 4 coefficient estimates–why do you think this is the case?

your answer here


NOTE: We see 4 indicator variables (for Good, Very Good, Premium, and Ideal), but we do not see cutFair in the model output. This is because Fair is the reference level of the cut variable (it’s first alphabetically). You’ll see below that it is, indeed, still in the model. You’ll also see why the term “reference level” makes sense!


  1. After examining the summary table output from the code chunk above, complete the model formula:


E[price | cut] = ___ +/- ___ cutGood +/- ___ cutVery Good +/- ___ cutPremium +/- ___ cutIdeal

 

Exercise 6: Making sense of the model

Recall our model: E[price | cut] = 4358.7578 - 429.8933 cutGood - 376.9979 cutVery Good + 225.4999 cutPremium - 901.2158 cutIdeal

  1. Use the model formula to calculate the expected/typical price for diamonds of Good cut.

  2. Similarly, calculate the expected/typical price for diamonds of Fair cut.

  3. Re-examine these 2 calculations. Where have you seen these numbers before?!

Exercise 7: Interpreting coefficients

Recall that our model formula is not a formula for a line. Thus we can’t interpret the coefficients as “slopes” as we have before. Taking this into account and reflecting upon your calculations above…

  1. Interpret the intercept coefficient (4358.7578) in terms of the data context. Make sure to use non-causal language, include units, and talk about averages rather than individual cases.

  2. Interpret the cutGood and cutVery Good coefficients (-429.8933 and -376.9979) in terms of the data context. Hint: where did you use these value in the prediction calculations above?

Exercise 8: Modeling choices (CHALLENGE)

Why do we fit this model in this way (using 4 indicator variables cutGood, cutVery Good, cutPremium, cutIdeal)? Instead, suppose that we created a single variable cutCat that gave each category a numerical value: 0 for Fair, 1 for Good, 2 for Very Good, 3 for Premium, and 4 for Ideal.

How would this change things? What are the pros and cons of each approach?

Reflection

Through the exercises above, you learned how to build and interpret models that incorporate a categorical predictor variable. For the benefit of your future self, summarize how one can interpret the coefficients for a categorical predictor.

Response: Put your response here.

Render your work

  • Click the “Render” button in the menu bar for this pane (blue arrow pointing right). This will create an HTML file containing all of the directions, code, and responses from this activity. A preview of the HTML will appear in the browser.
  • Scroll through and inspect the document to check that your work translated to the HTML format correctly.
  • Close the browser tab.
  • Go to the “Background Jobs” pane in RStudio and click the Stop button to end the rendering process.
  • Navigate to your “Activities” subfolder within your “STAT155” folder and locate the HTML file. You can open it again in your browser to double check.

Additional Practice

Exercise 9: Diamond color

Consider modeling price by color.

  • Before creating a visualization that shows the relationship between price and color, write down what you expect the plot to look like. Then construct and interpret an apporpriate plot.
  • Compute the average price for each color.
  • Fit an appropriate linear model with lm() and display a short summary of the model.
  • Write out the model formula from the above summary.
  • Which color is the reference level? How can you tell from the model summary?
  • Interpret the intercept and two other coefficients from the model in terms of the data context.

Exercise 10: Diamond clarity

If you want more practice, repeat the steps from Exercise 8 for the clarity variable.

Done!

  • Finalize your notes: (1) Render your notes to an HTML file; (2) Inspect this HTML in your Viewer – check that your work translated correctly; and (3) Outside RStudio, navigate to your ‘Activities’ subfolder within your ‘STAT155’ folder and locate the HTML file – you can open it again in your browser.
  • Clean up your RStudio session: End the rendering process by clicking the ‘Stop’ button in the ‘Background Jobs’ pane.
  • Check the solutions in the course website, at the bottom of the corresponding chapter.
  • Work on homework!